Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
Graphics Programming Hardware

Sorting Algorithm Breaks Giga-Sort Barrier, With GPUs 187

An anonymous reader writes "Researchers at the University of Virginia have recently open sourced an algorithm capable of sorting at a rate of one billion (integer) keys per second using a GPU. Although GPUs are often assumed to be poorly suited for algorithms like sorting, their results are several times faster than the best known CPU-based sorting implementations."
This discussion has been archived. No new comments can be posted.

Sorting Algorithm Breaks Giga-Sort Barrier, With GPUs

Comments Filter:
  • Excel Charts (Score:3, Insightful)

    by Anonymous Coward on Sunday August 29, 2010 @09:30PM (#33412026)

    I find it very disappointing that a group of programmer/computer science types who even supply BibTeX to make it easier to reference their work, resort to screen-capturing an Excel chart to display their data.

    • by bramp ( 830799 ) on Sunday August 29, 2010 @09:59PM (#33412128) Homepage
      I've also found this annoying when reading papers. Perhaps I just spend too much time learning how to use gnuplot so that my graphs look nice.
    • Re:Excel Charts (Score:4, Insightful)

      by Anpheus ( 908711 ) on Sunday August 29, 2010 @10:21PM (#33412208)

      Maybe excel was just the right tool for the job? It's quick and easy to use, and to reformat the graphs.

      I know the Linux tools tend to be a little longer between tweaking, rendering and displaying, so a fast WYSIWIG tool works just fine.

    • Re:Excel Charts (Score:5, Insightful)

      by pspahn ( 1175617 ) on Monday August 30, 2010 @01:29AM (#33412798)
      Actually, I find even more disappointing that a decent way to display datasets on a web page isn't standard yet. Why can't a nice one be embeddable with column sorts and robust methods for retrieving data? There are solutions, sure, but I have yet to find one that isn't unnecessarily complex or just plain ugly and difficult to use. But I guess it's just a matter of time, right?
    • Re:Excel Charts (Score:3, Insightful)

      by dominious ( 1077089 ) on Monday August 30, 2010 @05:54AM (#33413464)
      what. the. fuck. +4 Insightful because they use Excel charts?

      Hey I just solved N=NP.
      Yeah, but you are using Excel charts...hmmm sorry kthnx later.
    • Re:Excel Charts (Score:3, Insightful)

      by multipartmixed ( 163409 ) on Monday August 30, 2010 @09:40AM (#33414950) Homepage

      1. Why do you believe those are screen captures, rather than, say, exported images?

      2. How would the data look different it were displayed with BibTeX?

      3. How fast is using BibTeX? (I've never used it). I could create those same charts in Excel '97 from a CSV of input points easily; probably in under a minute.

  • by black3d ( 1648913 ) on Sunday August 29, 2010 @09:33PM (#33412038)

    Specifically, a GTX480 (just over 1 B keys/sec), followed up by a Tesla 2050 at around 75% of the speed of the GTX480. (745 M keys/sec)

  • by Anonymous Coward on Sunday August 29, 2010 @09:47PM (#33412082)

    GPUs have always been better at sorting your money from your wallet.

  • Not a barrier (Score:5, Insightful)

    by Captain Segfault ( 686912 ) on Sunday August 29, 2010 @09:49PM (#33412084) Homepage Journal

    This isn't a "barrier" like the "sound barrier". There are no special difficulties that start around 1G/sec! It's just a threshold.

    Don't get me wrong -- I'm not saying this isn't impressive, but no "barrier" was broken here!

  • Um... (Score:5, Insightful)

    by Anonymous Coward on Sunday August 29, 2010 @09:50PM (#33412088)

    Algorithms aren't measured in "x per second"... only implementations are measured that way. The speed of an algorithm is described in big-O notation, such as O(n log n). The metric of "sorted keys per second" is largely useless, because it depends on the particular hardware setup.

    • Re:Um... (Score:5, Interesting)

      by PrecambrianRabbit ( 1834412 ) on Sunday August 29, 2010 @10:19PM (#33412196)

      Given that the particular hardware setup is detailed here (a GTX 480 achieves the 1 billion keys/sec figure), and the algorithm used (radix sort) has known asymptotic behavior (O(nk) for n keys of length k), 10^9 keys/sec is quite meaningful, particularly since it's a significant implementation challenge (possibly even an algorithmic challenge) to port this algorithm to a GPU.

      Furthermore, I think sorting speed is appropriately measured in keys/sec. Big-O does not in fact describe the speed, but rather the upper bound of the growth of an algorithm's asymptotic running time, which needs to be paired with the implementation, architecture, and data set to determine a speed. It turns out the constant factors can actually be quite important in practice.

      • Also (Score:5, Interesting)

        by Sycraft-fu ( 314770 ) on Monday August 30, 2010 @12:30AM (#33412624)

        CS people get way too caught up in Big O forgetting that, as you note, it is the theory describing the upper bound on time, not actual practice AND is only relevant for one factor, n. A great example is ray tracing. CS people love ray tracing because most create a ray tracer as part of their studies. They love to talk on about how it is great for rendering because "It is O(log n)!" They love to hype it over rasteraztion like current GPUs do. However there's two really important things they forget:

        1) What is n? In the case, polygons. It scales with the log of the number of polygons you have. This is why ray tracer demos love showing off spheres made of millions of polygons and so on. It is cheap to do. However turns out polygons aren't the only thing that matters for graphics. Pixels also matter and ray tracing is O(n) with relation to pixels and it gets worse for anti aliasing. For FSAA you cast multiple rays per pixel. That means that 4x FSAA has 4x the computation requirements. Turns out rasterization scales much better with regards to resolution, and AA. In fact these days 4x FSAA is often O(1) in actual implementation in that it doesn't hit frame rate to turn it on. That is also why ray tracing demos are low resolution, because THAT'S the hard part.

        2) Real hardware limits. In the case of ray tracing, it is memory access. While those polygons scale in a perfectly logarithmic fashion in theory, real system RAM isn't so forgiving. As you start to have more and more you run in to RAM bandwidth limits and things slow down. All the memory access required becomes a limit that wasn't on paper. You can cry all you like that it wasn't a problem in theory, on actual hardware you run in to other limits.

        People need to better understand that it is a theoretical tool for comparing speed factors algorithms. That is useful, but you have to then consider the reality of the situation. CS people also need to understand for users, there is only one benchmark that matters: The clock on the wall. Whatever is faster is better. Doesn't matter what the theory says, if the hardware can do X faster than Y, then X is better according to users.

        • by chammy ( 1096007 ) on Monday August 30, 2010 @02:23AM (#33412964)

          This is why ray tracer demos love showing off spheres made of millions of polygons and so on. It is cheap to do. However turns out polygons aren't the only thing that matters for graphics.

          The spheres would most likely be represented as an equation, not a soup of polygons. It's much more efficient (not to mention a whole lot easier) to raytrace. It's also infinitely precise, which is actually why a lot of people are more interested in raytracing than approximating things with polygons.

          For instance, it's a heck of a lot easier to render isosurfaces [demon.co.uk] in a raytracer than turning them into a big polygon soup and rasterizing that.

        • by Thiez ( 1281866 ) on Monday August 30, 2010 @02:59AM (#33413024)

          I don't think it's a fair comparison, really. The only reason modern games look so pretty is because they have the GPU that happens to be designed to help them out. If companies had been designing GPUs to do raytracing for over a decade, it seems rather likely to me that we could do raytracing at modern resolutions.

          Wouldn't it be more fair to compare raytracing to rasterization when you're only allowed to use the CPU?

        • by smallfries ( 601545 ) on Monday August 30, 2010 @04:13AM (#33413222) Homepage

          Pixels also matter and ray tracing is O(n) with relation to pixels and it gets worse for anti aliasing.

          Not sure what your point is there. Any algorithm for rendering that spits out n pixels at the end has an O(n) lower-bound.

        • by jonaskoelker ( 922170 ) <.moc.oohay. .ta. .rekleoksanoj.> on Monday August 30, 2010 @04:31AM (#33413272)

          People need to better understand that it is a theoretical tool for comparing speed factors algorithms. That is useful, but you have to then consider the reality of the situation.

          Heck, why don't we computer scientists do the last step of this whole science thing and come up with a model that describes the actual reality more accurately?

          Oh, we do. Consider the I/O model, the cache-oblivious model or (for a fun combination) the parallel I/O model. (See for instance Aggarwal & Vitter: The Input/Output Complexity of Sorting and Related Problems.)

          if the hardware can do X faster than Y, then X is better according to users.

          Well, users as a collection run your algorithm with a distribution of inputs. You can't say that an algorithm is faster than another (either big-O or wall-clock) without talking about a particular case (worst, best, average) and distribution (diff works "better" on lines of code than paragraphs of latex/text; that's mostly a usability thing, but it illustrates different distributions of real-world data).

        • Re:Also (Score:2, Interesting)

          by Anonymous Coward on Monday August 30, 2010 @08:18AM (#33414156)

          You are missing the same part almost every other CS student out there misses

          K*O(N)+C

          There is usually some K involved that is done on every operation. This can turn a log n op into something huge. Or the set up (the C part) could take longer than to just brute force the issue. Many people make this mistake. However, hardware does not stand still either. People expect next years graphics to be more photo realistic than last years.

          You use raster vs ray (because you hate how long ray takes) as the reason you hate big O. Instead you should wonder isnt raster just a special low cost implementation of ray casting where the reflection is set to 1 or 2? Eventually ray will take over. It has to. There is only so much more pre cooking you can do (which is what raster is). The results do look good and people use it to good effect. However, eventually you have to ray trace it. In fact today you use many ray casting methods. Such as z buffer, per vertex shading, per pixel shaders, etc. These methods were invented to speed up ray casting. They work with raster as it is special case of ray tracing, one where many of the pixels are pre rendered. Eventually you can not get the correct colors unless you figure out all the light sources in the scene. Raster also has a scaling problem. In that you can stretch the pixels out but they turn into jaggy squares. Aliasing helps some but can only do so much, as the information just does not exist. As screens get larger and higher pixel depths, and higher geometry rates you eventually just can not move the raster images around the system fast enough to keep up.

          How do people know these things? They use big O and the other constants to see what it is doing. If you think nVidia, AMD, and Intel are going to keep making raster only systems from here til the end of time your dreaming. Intel cant make a graphics part we all know this. But they to date have made the most compelling ray tracing one. nVidia, and AMD both have stated eventually it will be ray casting. As they know raster only works for so long. As you will see more and more general highly parallel special purpose GPUs coming from these folks. Eventually it will be as easy to do ray casting in combination with raster to create very compelling graphics. Notice I didnt say raster goes away. Anything but, it will be around for a long time.

          Ray casting has a distinct advantage long term over raster. It is done 'as code' meaning you are not slugging huge bitmaps thru the system bus. But instead rendering directly into the video memory. The code has an advantage as it can be loaded and probably fit into the high speed memory system cache. Bitmaps will need to get ever larger to accommodate better looking graphics and higher polygon counts.

          Also you use FSAA as a 'this looks better' when in fact in quite a few cases it looks worse. You forget the FS part of that full screen. Meaning things that should not be aliased are. Ray casting can take that into account PER object, PER pixel, and depending on the z buffer and then decide only when its needed. With ray tracing you dont need FSAA all the time because you have already done the alias part at the ray cast level.

          Also raster image memory is not cheap either. It is why they built special bus's to handle it. You know AGP, PCIe, etc... It was a practical solution to what the industry is doing right now. However if you notice PCIe is not only graphics but about bandwidth in general. While AGP was a special bus for graphics. They know there are things coming up that need more bandwidth, ray casting being one of many.

          Also have you not wondered why the graphics chips guys have been in a huge hurry to make themselves into parallel processors? They do not want to be thrown to the side. The CPU eventually would be able to handle the raster images that they do today. With room to spare. They must make the graphics better and do it faster than a CPU could. It just so happens there are stacks of graphics algs t

        • Re:Also (Score:4, Interesting)

          by pz ( 113803 ) on Monday August 30, 2010 @08:33AM (#33414256) Journal

          People need to better understand that it is a theoretical tool for comparing speed factors algorithms. That is useful, but you have to then consider the reality of the situation.

          Right. And any good programmer understands that *first* you pick the right algorithm, and *then* you optimize the implementation. Working the other way around is wasting time.

          But, more importantly, that the parent seems to miss, is that the speed improvement from changing the order of growth of an algorithm can swamp nearly any possible improvements in hardware. Going from O(n^4) to O(n^2) when n is 10,000 is like replacing your desktop box with ten thousand supercomputers. No amount of implementation tweaking or upgrading processors and memory -- which only affects the constants in algorithm speed -- is going to give you the same level of speedup.

          There is a very, very good reason to pay attention to O() analysis of algorithms, and, in particular, their implementation. You can implement bubble sort which is O(n^2) when done correctly, in a brain-dead way that makes it O(n^3) --- if you, for example, re-extend the output array for each new element --- and the difference is huge. Extra factors of n creep in easily, and understanding them can be highly useful.

          So, the parent can review real-world constraints all he wants, but in the end, order of growth is more important.

        • by pitchpipe ( 708843 ) on Monday August 30, 2010 @10:29AM (#33415498)

          You can cry all you like that it wasn't a problem in theory, on actual hardware you run in to other limits.

          Reminds me of the quote:

          In theory there is no difference between theory and practice. In practice there is. - Yogi Berra

    • Re:Um... (Score:5, Interesting)

      by SpazmodeusG ( 1334705 ) on Sunday August 29, 2010 @10:36PM (#33412258)

      I wish they'd start putting the "P" into these Big-O notations, where the "P" is the number of processors. Some algorithms don't scale well, some do. Putting the P in illustrates this.

      eg. O( n/P ) illustrates an algorithm that scales perfectly with more cores added. O( n / log(P) ) not so much.

      • Re:Um... (Score:3, Funny)

        by Guido del Confuso ( 80037 ) on Monday August 30, 2010 @01:24AM (#33412778)

        O( n / log(P) ) not so much.

        That algorithm does particularly poorly on just one processor. In fact, if it ran successfully the universe would implode. [mathfail.com]

      • Re:Um... (Score:3, Interesting)

        by frosty_tsm ( 933163 ) on Monday August 30, 2010 @01:36AM (#33412816)
        While an interesting idea, many algorithms behave unpredictably on multiple processors (depending on how much communication would be required). Some will even be slower!

        The affect that extra CPUs will have is too dependent on the hardware implementation to be able to formalize like this.
      • Re:Um... (Score:3, Informative)

        by Anonymous Coward on Monday August 30, 2010 @01:43AM (#33412832)

        Typically, I hear researchers describe the parallelism of an algorithm separately from its computational complexity (big oh notation) using the idea of "speedup."

        The perfect scaling in your first example has linear speedup, and the second example has logarithmic speedup (that is, the speedup is log(p)).

        Here is the relevant Wikipedia article [wikipedia.org].

      • by Kjella ( 173770 ) on Monday August 30, 2010 @05:55AM (#33413472) Homepage

        Big-O notation does not describe the function on a physical processor with caches and pipelines, it's a pure mathemathical concept counting mathematical operations. Doubling n may mean things no longer fit in L1/L2/L3/RAM which will have huge performance implications, there is no simple notation for an actual implementation.

        Adding processors would be just one factor, what's the latency and bandwidth of the interconnects? What memory can they access and at what cost, NUMA plays a big role. There's a huge, huge difference between running something on a cluster (slow interconnects) and a supercomputer (fast interconnects) which is as last as big as the difference between CPU and GPU.

        I suppose you could create a theoretical parallelization with zero latency, infinite bandwidth counting only the mathmatical dependencies, but the results would be pretty much meaningless. Like this sorting algorithm you could parallelize with hundreds of thousands of simultanious compares - in theory. In practice the overhead of doing it would completely kill performance.

    • by Concerned Onlooker ( 473481 ) on Sunday August 29, 2010 @11:20PM (#33412386) Homepage Journal

      Obviously you weren't close enough to hear the sonic boom.

    • by Lobais ( 743851 ) on Monday August 30, 2010 @03:56AM (#33413164)

      It doesn't only depend on hardware setup. It depends on the number of keys!
      Unless the algorithm is actually O(n) and not O(n log n), the speed drops width the increase of keys to sort.

  • by PatPending ( 953482 ) on Sunday August 29, 2010 @09:59PM (#33412126)
  • by joeyadams ( 1724334 ) on Sunday August 29, 2010 @10:12PM (#33412168)
    I think the bubble sort would be the wrong way to go.

    —Barack Obama
  • by Murdoch5 ( 1563847 ) on Sunday August 29, 2010 @10:16PM (#33412180) Homepage
    Hold on, so I can play Jezz Ball, Chips Challenge and all my favourite old school games with a greater level of speed.
  • by jewishbaconzombies ( 1861376 ) on Sunday August 29, 2010 @10:17PM (#33412184)
    Thank god, because my online Porn collection isn't going to sort itself.
  • by crovira ( 10242 ) on Sunday August 29, 2010 @10:21PM (#33412204) Homepage

    you've got lots and lots of RAM with room for the keys and lots of space to waste for unfilled pointers.

    Pass 1, read the records, at the key radix store a record URI
    Pass 2, sweep RAM and read the record URIs in the order you encounter them copying them onto a sequential write device.

    I was doing this years ago.

    If you are careful with the number and sizes of your read buffers, the re-read done for pass 2 doesn't have to be all that disk intensive.

    You can even generate the keys using what ever hash function you find that is truly efficient and store collisions separately (leave a bit high and go into the a link list maintained by the hash generator for those few keys that hash redundantly.)

    • Well, yeah, they're not claiming they invented radix sort. They're claiming that their GPU implementation of radix sort runs about 4x as fast as the CPU implementation you describe.

      • by kramulous ( 977841 ) on Monday August 30, 2010 @02:18AM (#33412946)

        So, does that mean if they went out and got the fastest Xeon processor (they used the fastest gpu - excluding the C2070), wrote parallel and used the Intel Compiler (writing to it) the speedup over the cpu is less than zero?

        After having just looked at their code, also remove the cpu stl shit (actually any template since they don't vectorise). If you are writing speedy code for the gpu, to perform an adequate test for the cpu you also have to write appropriately.

        hahahahaha

        This gets better and better...
        They only timed the compute time. Cudamalloc was not part of the timing or cudamemcpy.

        Sorry, I only count 'time to solution'. That is all i'm interested in. I thought is was strange that a less than O(n**2) was faster on the gpu.

        It is like writing benchmarks that ignore disk IO, ignore reading from system memory, L3 cache, etc. Only time stuff that is in the registers.

        • by julesh ( 229690 ) on Monday August 30, 2010 @06:33AM (#33413572)

          This gets better and better...
          They only timed the compute time. Cudamalloc was not part of the timing or cudamemcpy.

          Sorry, I only count 'time to solution'. That is all i'm interested in.

          Total time including initialisation and reporting result is only useful if what you're looking at is a complete process -- but when was the last time you just needed to sort something and that sorting wasn't part of a larger algorithm? It's at least not totally out of the question that the data will already be in GPU memory, and that the next step of processing can occur there too, at which point these (O(n)) overheads become irrelevant.

    • by evilWurst ( 96042 ) on Monday August 30, 2010 @02:19AM (#33412948) Journal

      It's generally not size of RAM that breaks radix sort; it's the size of cache. Modern processors are highly reliant on cache, which means they're highly reliant on things in memory being in small tight chunks that fit in cache - because cache misses are expensive enough that if you thrash cache badly enough, you may end up running slower than if you hadn't had any cache at all.

      Good comparison sorts may start fragmented, but by their very nature each pass of the algorithm makes them less so. Radix sort is the other way around; it follows pointers (so more precious scarce cache in use already) that point in more and more fragmented patterns with every pass. That's why even though radix sort's average speed is theoretically faster than quicksort, quicksort still wins on real life hardware. And that's probably why radix sort wins on GPUs - the data fits in the card's dedicated memory, which is already optimized to be accessed in a much more parallel way than main memory.

  • by Anonymous Coward on Sunday August 29, 2010 @11:04PM (#33412340)

    The typical sorting problem I've encountered in my career (various types of scientific, telecommunications and business software, though not games) involves an array of pointers to fixed length records that have a sort key (let's say an integer) at a predefined offset. Not an array of integers, nor an array of directly embedded small fixed length records which I'm guessing was used in TFA. The former situation requires random as well as stream access to memory, which would likely favor processing by the CPU in the motherboard of a typical $1000-$2000 PC.

  • Efficient? (Score:2, Interesting)

    by scotty.m ( 1881826 ) on Monday August 30, 2010 @12:19AM (#33412578)
    Not sure how they are defining efficient in this experiement. I'd like to see how many clock cycles it took to sort each problem, maybe how much memory the radix sort is using too.
  • by Mr Z ( 6791 ) on Monday August 30, 2010 @03:52AM (#33413152) Homepage Journal
    You know, if they up it to just a bit over 4 billion unique 32-bit keys, say around 4,294,967,296 or so, I think I could sort them rather efficiently, as long as they weren't attached to any payload. ;-)
  • by brunes69 ( 86786 ) <slashdot&keirstead,org> on Monday August 30, 2010 @05:45AM (#33413446)

    I suspect your problem is going to become getting the keys in and out of the GPU memory, not the time the GP takes to sort them.

The best defense against logic is ignorance.

Working...